This data set logs every shot during the NBA 2014-2015 regular season from October 2014-March 2015 (Important to note that this is not the span of the entire season) including a variety of factors that are relevant to the shot. We obtained this data from https://www.kaggle.com/dansbecker/nba-shot-logs, the data is scraped from NBA’s REST API.
shots <- read_csv("shot_logs.csv")
## Rows: 128069 Columns: 21
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): MATCHUP, LOCATION, W, SHOT_RESULT, CLOSEST_DEFENDER, player_name
## dbl (14): GAME_ID, FINAL_MARGIN, SHOT_NUMBER, PERIOD, SHOT_CLOCK, DRIBBLES,...
## time (1): GAME_CLOCK
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
To get the gist of the data set, we will utilize the summary, structure, and head functions to depict the different aspects of the variables and observations the data set has.
summary(shots)
## GAME_ID MATCHUP LOCATION W
## Min. :21400001 Length:128069 Length:128069 Length:128069
## 1st Qu.:21400233 Class :character Class :character Class :character
## Median :21400449 Mode :character Mode :character Mode :character
## Mean :21400452
## 3rd Qu.:21400673
## Max. :21400908
##
## FINAL_MARGIN SHOT_NUMBER PERIOD GAME_CLOCK
## Min. :-53.0000 Min. : 1.000 Min. :1.000 Length:128069
## 1st Qu.: -8.0000 1st Qu.: 3.000 1st Qu.:1.000 Class1:hms
## Median : 1.0000 Median : 5.000 Median :2.000 Class2:difftime
## Mean : 0.2087 Mean : 6.507 Mean :2.469 Mode :numeric
## 3rd Qu.: 9.0000 3rd Qu.: 9.000 3rd Qu.:3.000
## Max. : 53.0000 Max. :38.000 Max. :7.000
##
## SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST
## Min. : 0.00 Min. : 0.000 Min. :-163.600 Min. : 0.00
## 1st Qu.: 8.20 1st Qu.: 0.000 1st Qu.: 0.900 1st Qu.: 4.70
## Median :12.30 Median : 1.000 Median : 1.600 Median :13.70
## Mean :12.45 Mean : 2.023 Mean : 2.766 Mean :13.57
## 3rd Qu.:16.68 3rd Qu.: 2.000 3rd Qu.: 3.700 3rd Qu.:22.50
## Max. :24.00 Max. :32.000 Max. : 24.900 Max. :47.20
## NA's :5567
## PTS_TYPE SHOT_RESULT CLOSEST_DEFENDER
## Min. :2.000 Length:128069 Length:128069
## 1st Qu.:2.000 Class :character Class :character
## Median :2.000 Mode :character Mode :character
## Mean :2.265
## 3rd Qu.:3.000
## Max. :3.000
##
## CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM PTS
## Min. : 708 Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:101249 1st Qu.: 2.300 1st Qu.:0.0000 1st Qu.:0.0000
## Median :201949 Median : 3.700 Median :0.0000 Median :0.0000
## Mean :159039 Mean : 4.123 Mean :0.4521 Mean :0.9973
## 3rd Qu.:203079 3rd Qu.: 5.300 3rd Qu.:1.0000 3rd Qu.:2.0000
## Max. :530027 Max. :53.200 Max. :1.0000 Max. :3.0000
##
## player_name player_id
## Length:128069 Min. : 708
## Class :character 1st Qu.:101162
## Mode :character Median :201939
## Mean :157238
## 3rd Qu.:202704
## Max. :204060
##
str(shots)
## spec_tbl_df [128,069 x 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ GAME_ID : num [1:128069] 21400899 21400899 21400899 21400899 21400899 ...
## $ MATCHUP : chr [1:128069] "MAR 04, 2015 - CHA @ BKN" "MAR 04, 2015 - CHA @ BKN" "MAR 04, 2015 - CHA @ BKN" "MAR 04, 2015 - CHA @ BKN" ...
## $ LOCATION : chr [1:128069] "A" "A" "A" "A" ...
## $ W : chr [1:128069] "W" "W" "W" "W" ...
## $ FINAL_MARGIN : num [1:128069] 24 24 24 24 24 24 24 24 24 1 ...
## $ SHOT_NUMBER : num [1:128069] 1 2 3 4 5 6 7 8 9 1 ...
## $ PERIOD : num [1:128069] 1 1 1 2 2 2 4 4 4 2 ...
## $ GAME_CLOCK : 'hms' num [1:128069] 01:09:00 00:14:00 00:00:00 11:47:00 ...
## ..- attr(*, "units")= chr "secs"
## $ SHOT_CLOCK : num [1:128069] 10.8 3.4 NA 10.3 10.9 9.1 14.5 3.4 12.4 17.4 ...
## $ DRIBBLES : num [1:128069] 2 0 3 2 2 2 11 3 0 0 ...
## $ TOUCH_TIME : num [1:128069] 1.9 0.8 2.7 1.9 2.7 4.4 9 2.5 0.8 1.1 ...
## $ SHOT_DIST : num [1:128069] 7.7 28.2 10.1 17.2 3.7 18.4 20.7 3.5 24.6 22.4 ...
## $ PTS_TYPE : num [1:128069] 2 3 2 2 2 2 2 2 3 3 ...
## $ SHOT_RESULT : chr [1:128069] "made" "missed" "missed" "missed" ...
## $ CLOSEST_DEFENDER : chr [1:128069] "Anderson, Alan" "Bogdanovic, Bojan" "Bogdanovic, Bojan" "Brown, Markel" ...
## $ CLOSEST_DEFENDER_PLAYER_ID: num [1:128069] 101187 202711 202711 203900 201152 ...
## $ CLOSE_DEF_DIST : num [1:128069] 1.3 6.1 0.9 3.4 1.1 2.6 6.1 2.1 7.3 19.8 ...
## $ FGM : num [1:128069] 1 0 0 0 0 0 0 1 0 0 ...
## $ PTS : num [1:128069] 2 0 0 0 0 0 0 2 0 0 ...
## $ player_name : chr [1:128069] "brian roberts" "brian roberts" "brian roberts" "brian roberts" ...
## $ player_id : num [1:128069] 203148 203148 203148 203148 203148 ...
## - attr(*, "spec")=
## .. cols(
## .. GAME_ID = col_double(),
## .. MATCHUP = col_character(),
## .. LOCATION = col_character(),
## .. W = col_character(),
## .. FINAL_MARGIN = col_double(),
## .. SHOT_NUMBER = col_double(),
## .. PERIOD = col_double(),
## .. GAME_CLOCK = col_time(format = ""),
## .. SHOT_CLOCK = col_double(),
## .. DRIBBLES = col_double(),
## .. TOUCH_TIME = col_double(),
## .. SHOT_DIST = col_double(),
## .. PTS_TYPE = col_double(),
## .. SHOT_RESULT = col_character(),
## .. CLOSEST_DEFENDER = col_character(),
## .. CLOSEST_DEFENDER_PLAYER_ID = col_double(),
## .. CLOSE_DEF_DIST = col_double(),
## .. FGM = col_double(),
## .. PTS = col_double(),
## .. player_name = col_character(),
## .. player_id = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(shots)
## # A tibble: 6 x 21
## GAME_ID MATCHUP LOCATION W FINAL_MARGIN SHOT_NUMBER PERIOD GAME_CLOCK
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <time>
## 1 21400899 MAR 04, 20~ A W 24 1 1 01:09
## 2 21400899 MAR 04, 20~ A W 24 2 1 00:14
## 3 21400899 MAR 04, 20~ A W 24 3 1 00:00
## 4 21400899 MAR 04, 20~ A W 24 4 2 11:47
## 5 21400899 MAR 04, 20~ A W 24 5 2 10:34
## 6 21400899 MAR 04, 20~ A W 24 6 2 08:15
## # ... with 13 more variables: SHOT_CLOCK <dbl>, DRIBBLES <dbl>,
## # TOUCH_TIME <dbl>, SHOT_DIST <dbl>, PTS_TYPE <dbl>, SHOT_RESULT <chr>,
## # CLOSEST_DEFENDER <chr>, CLOSEST_DEFENDER_PLAYER_ID <dbl>,
## # CLOSE_DEF_DIST <dbl>, FGM <dbl>, PTS <dbl>, player_name <chr>,
## # player_id <dbl>
After briefly exploring this data set, we notice there are 128069 observations with 21 variables. See the data dictionary below to see what each variable represents. Additionally, there is a mix of double and character data types throughout this data set. There is also a mix of numerical and categorical variables such as, measuring the shot distance as well as whether the game was at home or away.
This dictionary describes what each variable of the data set represents.
GAME_ID
Identification number of a particular NBA game.
MATCHUP
The date the game occurred as well as the two teams that faced off.
LOCATION
Whether the team was playing at home (H) or away (A).
W Whether the team won (W) or lost (L).
FINAL_MARGIN
Final margin of victory or defeat for that game.
SHOT_NUMBER
The numbered shot attempt by a player.
PERIOD
The period in which the shot was attempted.
GAME_CLOCK
The game clock at the time the shot was attempted.
SHOT_CLOCK
The shot clock at the time the shot was attempted.
DRIBBLES
The number of dribbles prior to the shot attempted.
TOUCH_TIME
How long the player was holding the ball before shooting.
SHOT_DIST
How far away, in feet, the player was when shooting the ball.
PTS_TYPE
What kind of shot it was, either a 2 pointer or 3 pointer.
SHOT_RESULT
Whether the shot missed or made.
CLOSEST_DEFENDER
Who the closest defender was when the shot was taken.
CLOSEST_DEFENDER_PLAYER_ID
Identification (number) assigned to the closest defender when the shot was taken.
CLOSE_DEF_DIST
Distance of the closest defender to the person who took that shot (in feet).
FGM
Number of shots scored by the shot.
PTS
Number of points awarded for the shot.
Player_name
Name of the player who took the shot.
Player_id
identification(number) assigned to the player who took the shot.
Our research questions we want to answer are what kind of shots are taken (in terms of distance from the basket) and how often they are being made at different times of the shot clock. In particular, we will analyze two players from the Golden State Warriors, Stephen Curry and Draymond Green.
The first part of cleaning this data set is to separate the MATCHUP column into two separate columns. The original column has both the date and the two teams playing, so it makes sense to create a separate DATE column for that information. Moreover, we want to convert the DATE column into a date type column using lubridate.
shots <- shots %>%
separate(MATCHUP, into = c("DATE", "MATCHUP"), sep=" - ")
shots$DATE<-shots$DATE%>%mdy()
shots <- shots %>%
mutate(TEAM = str_sub(shots$MATCHUP, 1, 3))
Another cleaning method we utilized is to reformat the player_name variable. The reason why we did this is because we want to make it consistent with the CLOSEST_DEFENDER column, which is formatted as “Last name, First name”. Because player_name is formatted as “first last” we will reverse the 2 parts of the name and add a comma between them. The rowwise() function allows it to collapse the vector for each row of the data frame.
shots$player_name<-str_to_title(shots$player_name)
shots$player_name<-str_split(shots$player_name, pattern = " ")
shots$player_name<-lapply(X = shots$player_name,FUN = rev)
shots <- shots %>%
rowwise() %>%
mutate(player_name = str_c(player_name, collapse = ", "))
In order to analyze the shots taken and the time left on the shot clock for Stephen Curry and Draymond Green, we will utilize plotly so the user can hover over the data point and see the exact times and distances (Shot Clock, Shot Distance) at which the shot was taken. We will filter the data set by firstly the Golden State Warriors, and then Curry and Green. Moreover, we will color by the player so it is easier to visualize who shot which shot.
shots2<-shots %>%
filter(TEAM == "GSW") %>%
filter(player_name %in% c("Curry, Stephen", "Green, Draymond"))
plot_ly(shots2, x = ~ SHOT_CLOCK , y = ~ SHOT_DIST)%>%
layout(title = "Curry vs. Green Shot Selection (2014-15)") %>%
add_markers(color = shots2$player_name,
text = ~paste0('Player: ', player_name))
## Warning: Ignoring 34 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Based on the visualization there are some features that stood out. Firstly, the spread and number of shots taken by Stephen Curry with respect to the distance; it goes to show that there are almost no kinds of shots he isn’t afraid to take. Comparing it to a defensive player like Draymond Green, we can see that Green had far fewer mid-range shots and shot mostly from 3 and in the paint. An interesting cluster of points at the bottom right of the plot representing points scored by Draymond Green very close to the basket might attest to the fact that Green anticipates turnovers better than Curry; as a result, he is readily available to receive outlet passes and shoot in the paint at the start of the possession.
The plot below displays a histogram representing the frequency of shots taken based on distance from the basket. We will create a facet plot for six players (Stephen Curry, LeBron James, James Harden, Kyrie Irving, DeMarcus Cousins and Anthony Davis). This can better observe how these players take different shots at different times of the shot clock.
shots %>%
filter(player_name %in% c("Curry, Stephen", "James, Lebron", "Harden, James", "Irving, Kyrie", "Davis, Anthony", "Cousins, Demarcus")) %>%
ggplot() +
geom_histogram(aes(x = SHOT_DIST, fill = player_name)) +
labs(title = "Shot Usage by Distance Among Top NBA Players (2014-15)",
x = "Shot Distance (Feet)",
y = "Shot Usage") +
facet_wrap(~ player_name) +
theme_clean() +
theme(legend.position = "none")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Based on the figure above, we see that guards have a tendency to shoot from three more often than forwards. This is not surprising as the guards have different skill sets and responsibilities compared to forwards. What is interesting is that all players do not shoot from mid-range as often, which suggests the change in the game in recent years. In addition, James Harden shot more threes than Stephen Curry, despite Curry is most synonymous with 3-point shots.
We want to compare the efficacy between the 2-point and 3-point shots. Our goal is to answer what is the percentage of 2 pointers vs 3 pointers in the 2014-2015 season? Additionally, what type of shot is the most successful?
In order to answer the first question, we first created another column prop which calculates the proportion of shots attempted between 2 and 3 pointers. Additionally, we factored the PTS_TYPE in order to make it discrete instead of continuous. Afterwards, we created a pie chart representing the proportions.
shotcount <- shots %>%
group_by(PTS_TYPE) %>%
summarize(countmm = n()) %>%
mutate(prop = countmm/sum(countmm))
shotcount
## # A tibble: 2 x 3
## PTS_TYPE countmm prop
## <dbl> <int> <dbl>
## 1 2 94173 0.735
## 2 3 33896 0.265
shotcount$PTS_TYPE<-factor(shotcount$PTS_TYPE)
ggplot(shotcount, aes(x="", y=prop, fill=PTS_TYPE)) +
geom_bar(stat="identity", width=1) +
ggtitle("Percentage of 2 point and 3 point Shots Attempted (2014-15)") +
coord_polar("y", start=0)+theme_void() +
scale_fill_brewer(palette="Set1")
Based on the figure, 73.5 percent of shots were 2 pointers in the 2014-2015 NBA season. 26.5 % of shots were 3 pointers. To further explore this data, we decided to see what percentage of 2 pointers and 3 pointers were actually made during the season.
Because we’re only focusing on shots that were made, we filtered the data set as such. The process afterwards is similar to as the previous question. The only difference is we created a bar graph comparing the two shot types.
shotresult <- shots %>%
filter(SHOT_RESULT=="made") %>%
group_by( SHOT_RESULT, PTS_TYPE) %>%
summarize(countmm = n()) %>%
mutate(prop = countmm/sum(countmm))
## `summarise()` has grouped output by 'SHOT_RESULT'. You can override using the `.groups` argument.
shotresult1 <- shotresult %>%
mutate(prop1=prop*100)
shotresult1
## # A tibble: 2 x 5
## # Groups: SHOT_RESULT [1]
## SHOT_RESULT PTS_TYPE countmm prop prop1
## <chr> <dbl> <int> <dbl> <dbl>
## 1 made 2 45990 0.794 79.4
## 2 made 3 11915 0.206 20.6
shotresult1$PTS_TYPE<-factor(shotresult1$PTS_TYPE)
ggplot(shotresult1, aes(x=PTS_TYPE, y=prop1, fill =PTS_TYPE)) +
geom_bar(stat="identity" )+ ggtitle("Percentage of 2 point and 3 point Shots Made (2014-15)") +
xlab("Type of Points") + ylab("Percentage")+ theme_economist()+
theme(legend.title=element_blank())
Based on the figure, we found that 79.4% of all made shots were 2-pointers while 20.6% were 3-pointers throughout the season, which shows that despite the rise in the 3-point shot, the reliance on the 2-point shot is still strong.